fix(compute): explicit kaniko resources + imagePullPolicy Always by mastermanas805 · Pull Request #2 · InstaNode-dev/api

mastermanas805 · 2026-05-11T08:07:01Z

Summary

Two friction-point fixes that an unaided AI agent hits when redeploying via `/deploy/new`:

1. Kaniko build pod was throttled to 50m CPU / 256Mi RAM

The per-namespace `LimitRange` (hobby default) was the only thing sizing the kaniko container — non-trivial `npm install` hung 5+ minutes. Sets explicit `Requests=250m/256Mi` and `Limits=1/512Mi` on the kaniko container. Build runs as a Job, so the bigger allocation doesn't inflate the app's permanent footprint.

2. `imagePullPolicy: IfNotPresent` + `:latest` tag → silent stale redeploys

When kaniko pushed a new image to GHCR under the same `:latest` tag, k8s served the cached old layer because the tag string was unchanged. Redeploys appeared to succeed while pods kept the prior code. Switched to `PullAlways` everywhere (single-app + stack).

3. `K8sProvider.clientset` is now `kubernetes.Interface`

Drop-in change — `*Clientset` already implements `Interface`. Enables tests to inject `fake.NewSimpleClientset`.

Test plan

`TestKanikoJobHasExplicitResources`: asserts the kaniko container has Requests and Limits set, with the CPU request at the 250m floor
`TestAppDeploymentUsesPullAlways`: asserts `ImagePullPolicy == PullAlways` on the app container
`go build ./...` passes
Follow-up: sha-pin image tags so we can switch back to `PullIfNotPresent` and save bandwidth

🤖 Generated with Claude Code

Two related fixes that an agent hits when redeploying via /deploy/new: 1. Kaniko build pod was sized by the per-namespace LimitRange (hobby tier default: 50m CPU / 256Mi RAM). Non-trivial npm install hung 5+ minutes. Now sets explicit Requests=250m/256Mi and Limits=1/512Mi on the kaniko container. Build runs as a Job, so the higher allocation does not inflate the app's permanent quota. 2. App container had imagePullPolicy=IfNotPresent on the :latest tag. After a redeploy pushed a new image to GHCR, k8s served the cached old layer because the tag string was unchanged. Redeploys silently succeeded while pods kept the prior code. Switched to PullAlways. Also makes the K8sProvider.clientset field a kubernetes.Interface so tests can inject fake.NewSimpleClientset. Tests cover the resource floor + PullAlways guarantee. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s package (#12) User-facing pain that drove this: every domain rename or proxy hostname tweak required scraping the codebase with grep + sed. The instant.dev → instanode.dev rename earlier today touched 28 files and still missed some (we're still finding stragglers). The "Use named constants, not inline strings" memo applies — extracting was overdue. New package internal/urls/urls.go centralises: Public hostnames: PublicAPIBase = "https://api.instanode.dev" PublicMarketingBase = "https://instanode.dev" StartURLPrefix = PublicMarketingBase + "/start" DeploymentWildcard = "deployment.instanode.dev" StoragePublicHost = "s3.instanode.dev" Cluster-internal proxy FQDNs (for in-cluster workloads — friction PR #2): InternalPGProxy = "instant-pg-proxy.instant.svc.cluster.local:5432" InternalRedisProxy = "instant-redis-proxy.instant.svc.cluster.local:6379" InternalMongoProxy = "instant-mongo-proxy.instant.svc.cluster.local:27017" InternalNATSProxy = "instant-nats-proxy.instant.svc.cluster.local:4222" InternalMinIO = "minio.instant-data.svc.cluster.local:9000" Helper: UpgradeStartURL(jwt) — canonical builder for /start?t=<jwt>, replaces 12 sites that had identical fmt.Sprintf calls. Call sites refactored: - 12 fmt.Sprintf upgrade-URL calls across 6 provisioning handlers → urls.UpgradeStartURL(jwtToken) - proxiedInternalURL in handlers/internal_url.go → uses 4 urls.Internal* - upgradeNote / limitExceededNote bare links → urls.StartURLPrefix - canonicalAPIBase (handlers/auth.go) → alias of urls.PublicAPIBase - defaultCanonicalResourceURL (middleware/auth.go) → alias of urls.PublicAPIBase (eliminates the long-standing duplication of the same string in two different files) - 15 user-facing error strings ("Sign up at https://instanode.dev/start") → string concat with urls.StartURLPrefix Audit results: Before: 28+ occurrences of "instanode.dev/start" in non-test handler code After: 1 occurrence (an inline example URL inside the OpenAPI JSON description string — appropriate to leave as a literal since it's sample documentation, not a runtime-constructed URL) Before: 4 proxy FQDN string literals in handlers/internal_url.go After: 0 (all 4 use urls.* constants) Tests: internal/urls/urls_test.go (12 sub-tests): TestPublicHostnames_MatchExpectedShape — guards 5 public hosts TestInternalProxyHostnames_CorrectPortsAndService — guards 5 internal TestUpgradeStartURL_Composition — builder semantics Existing tests still pass: TestProxiedInternalURL — 7 sub-cases unchanged after refactor TestUpgradeNote_DoesNotMentionTrial / TestLimitExceededNote_… — 6 cases TestWhoami_*, TestDeployNew_EnvVars_*, all TestOpenAPI_* — pass Live verification on v2.1.0-urls-package in prod: POST /db/new (fresh) → upgrade host = "instanode.dev" internal_url host = "instant-pg-proxy.instant.svc.cluster.local:5432" note starts with "Works for 24h free" GET /api/v1/whoami → 200 with team_id + plan_tier + team_name (proves the middleware audience lookup still works via the urls.PublicAPIBase alias) Out of scope (separate follow-ups): - Email templates in internal/email/email.go (45 occurrences) — these are marketing copy strings with their own care; need different extraction approach (probably a templated config). - SDK references (sdk-go, sdk-node, mcp) — each is its own repo; consolidating across modules needs a shared common/urls package and cross-repo coordination. - Worker / provisioner repos — they have far fewer URL string instances and can do their own thing. - The OpenAPI bearerAuth description's inline example URL — sample documentation, not a runtime URL. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tract bugs The api Deploy workflow had been red on every run. Root causes, all fixed: Schema-mirror drift (the bulk of the ~25 failures) testhelpers.runMigrations hand-mirrors the prod schema; CI ran against a bare Postgres so any migration that added a table/column without a mirror edit silently broke the gate. Completed the mirror (email_events, pending_deletions, deployment_events, api_keys, magic_links, custom_domains, service_components, uptime_samples + 9 drifted columns) and added TestRunMigrationsMirrorsEveryMigrationTable — a guard that enumerates every CREATE TABLE in the migration files and fails if one is unmirrored. deploy.yml now also applies the real migration files before testing, so CI runs the same schema developers do (kills column drift too). Genuine product/test bugs the lax mirror had been masking: - rate_limit.go / idempotency.go: a nil *redis.Client SIGSEGV'd the whole API on the first request. Both now fail open (CLAUDE.md convention #1). - cache/redis.go provisionLocal/StorageBytes: nil Redis client now returns a clean error → handler 503, never a panic (convention #2). - models.AcceptInvitation: did not clear is_primary when moving an invitee into a team, violating uq_users_one_primary_per_team → 500. Now cleared. - resource_elevate_test.go inserted stack/deployment rows missing the real NOT NULL columns (namespace, app_id); now valid. - Fixed 5 handler tests with stale tier/seat/mock assumptions and dropped the stale -skip list (TestOpenAPI/TestCrossTeam/TestCustomDomainCreate all pass now). go test ./... -short: 25 packages, 0 FAIL, 0 panic. build + vet clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t no-op Two contract-drift P0s flagged in the bug-hunt brief, both shipped in the same PR because they share the test gate. B13-P0-F1: POST /auth/cli auth_url leaked dead-brand * Old: frontendURL() returned "https://instant.dev" in production — instant.dev is the dead-brand marketing host (404s). An agent following the auth_url landed on a parking page and gave up. * Fix: read cfg.DashboardBaseURL (DASHBOARD_BASE_URL env var); fall back to "https://instanode.dev" when DASHBOARD_BASE_URL is unset in production. internal/handlers/cli_auth.go. * Coverage: TestAuth_CLI_ReturnsInstanodeDomain (4 cases — explicit base, empty-fallback, dev default, trailing-slash strip) + TestAuth_CLI_NoLegacyInstantDevInResponses (rule-17 source-scan that fails if a new handler emits instant.dev/<x> anywhere). * Also cleaned: internal/handlers/logs.go (error message URL), internal/handlers/openapi.go (storage endpoint example). B7-P0-1: PATCH /stacks/:slug/env was a silent no-op * Old: handler logged stack.env.noted, returned 200, persisted nothing — the next /stacks/:slug/redeploy rebuilt with stale env. Silent-data-loss failure mode. * Fix (Option A from the brief): migration 062 adds stacks.env_vars JSONB DEFAULT '{}'::jsonb. Handler now loads existing env, merges (empty-string value deletes), validates every key against isValidEnvKey (POSIX [A-Z_][A-Z0-9_]*), persists, emits stack.env.updated audit_log row, and returns the full merged map. * 64KiB cap on serialized payload (ErrStackEnvVarsTooLarge → 413). * OpenAPI updated to describe PATCH semantics + new 409/413 responses + the merged-env response shape. * Coverage: TestStack_PatchEnv_PersistsAndReturns (7 subtests — auth, first patch, merge, delete-via-empty, invalid_env_key, missing_env, 404 unknown slug). All assertions verify the DB round-trip — pre-fix the handler returned 200 but persisted nothing, which is exactly what subtest #2 now fails on. Coverage block per CLAUDE.md rule 17: Symptom: instant.dev/<x> in handler-emitted strings Enumeration: rg -nF "instant.dev/" internal/handlers/ --type go Sites found: 3 (cli_auth.go:339, logs.go:136, openapi.go:2784) Sites touched: 3 — all three migrated to instanode.dev Coverage test: TestAuth_CLI_NoLegacyInstantDevInResponses scans every .go file under internal/handlers/ and fails if any non-test, non-comment, non-label, non-import-path mention of instant.dev/<x> appears. Live verified: pending — see follow-up commit after deploy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mastermanas805 merged commit 082f9fe into master May 11, 2026

mastermanas805 deleted the fix/deploy-compute-correctness branch May 11, 2026 08:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compute): explicit kaniko resources + imagePullPolicy Always#2

fix(compute): explicit kaniko resources + imagePullPolicy Always#2
mastermanas805 merged 1 commit into
masterfrom
fix/deploy-compute-correctness

mastermanas805 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mastermanas805 commented May 11, 2026

Summary

1. Kaniko build pod was throttled to 50m CPU / 256Mi RAM

2. `imagePullPolicy: IfNotPresent` + `:latest` tag → silent stale redeploys

3. `K8sProvider.clientset` is now `kubernetes.Interface`

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant